perm filename MULTID[4,KMC]2 blob
sn#030215 filedate 1973-03-16 generic text, type T, neo UTF8
00100 MULTIDIMENSIONAL ANALYSIS IN EVALUATING A SIMULATION
00200 OF PARANOID THOUGHT PROCESSES
00300
00400 K.M. COLBY AND F.D. HILF
00500
00600
00700
00800 Once a simulation model reaches a stage of intuitive
00900 adequacy, a model builder should consider using more stringent
01000 evaluation procedures relevant to the model's purposes. For example,
01100 if the model is to serve as a as a training device, then a simple
01200 evaluation of its pedagogic effectiveness would be sufficient. But
01300 when the model is proposed as an explantion of a psychological
01400 process, more is demanded of the evaluation procedure.
01500 We shall not describe our model of paranoid processes here.
01600 A description can be found in the literature (Colby, Weber, and Hilf,
01700 1971). We shall concentrate on the evaluation problem which asks "how
01800 good is the model?" or "how close is the correspondence between the
01900 behavior of the model and the phenomenena it is intended to explain?"
02000 Turing's Test has often been suggested as an aid in answering this
02100 question.
02200 It is very easy to become confused about Turing's Test. In
02300 part this is due to Turing himself who introduced the now-famous
02400 imitation game in a paper entitled COMPUTING MACHINERY AND
02500 INTELLIGENCE (Turing,1950). A careful reading of this paper reveals
02600 there are actually two imitation games , the second of which is
02700 commonly called Turing's test.
02800 In the first imitation game two groups of judges try to
02900 determine which of two interviewees is a woman. Communication between
03000 judge and interviewee is by teletype. Each judge is initially
03100 informed that one of the interviewees is a woman and one a man who
03200 will pretend to be a woman. After the interview, the judge is asked
03300 what we shall call the woman-question i.e. which interviewee was the
03400 woman? Turing does not say what else the judge is told but one
03500 assumes the judge is NOT told that a computer is involved nor is he
03600 asked to determine which interviewee is human and which is the
03700 computer. Thus, the first group of judges would interview two
03800 interviewees: a woman, and a man pretending to be a woman.
03900 The second group of judges would be given the same initial
04000 instructions, but unbeknownst to them, the two interviewees would be
04100 a woman and a computer programmed to imitate a woman. Both groups
04200 of judges play this game until sufficient statistical data are
04300 collected to show how often the right identification is made. The
04400 crucial question then is: do the judges decide wrongly AS OFTEN when
04500 the game is played with man and woman as when it is played with a
04600 computer substituted for the man. If so, then the program is
04700 considered to have succeeded in imitating a woman as well as a man
04800 imitating a woman. For emphasis we repeat; in asking the
04900 woman-question in this game, judges are not required to identify
05000 which interviewee is human and which is machine.
05100 Later on in his paper Turing proposes a variation of the
05200 first game. In the second game one interviewee is a man and one is a
05300 computer. The judge is asked to determine which is man and which is
05400 machine, which we shall call the machine-question. It is this version
05500 of the game which is commonly thought of as Turing's test. It has
05600 often been suggested as a means of validating computer simulations of
05700 psychological processes.
05800 In the course of testing a simulation (PARRY) of paranoid
05900 linguistic behavior in a psychiatric interview, we conducted a number
06000 of Turing-like indistinguishability tests (Colby, Hilf,Weber and
06100 Kraemer,1972). We say `Turing-like' because none of them consisted of
06200 playing the two games described above. We chose not to play these
06300 games for a number of reasons which can be summarized by saying that
06400 they do not meet modern criteria for good experimental design. In
06500 designing our tests we were primarily interested in learning more
06600 about developing the model. We did not believe the simple
06700 machine-question to be a useful one in serving the purpose of
06800 progressively increasing the credibility of the model but we
06900 investigated a variation of it to satisfy the curiosity of colleagues
07000 in artificial intelligence.
07100 In this design eight psychiatrists interviewed by teletype
07200 two patients using the technique of machine-mediated interviewing
07300 which involves what we term "non-nonverbal" communication since
07400 non-verbal cues are made impossible (Hilf,1972). Each judge
07500 interviewed two patients one being PARRY and one being a hospitalized
07600 paranoid patient. The interviewers were not informed that a
07700 simulation was involved nor were they asked to identify which was the
07800 machine. Their task was to conduct a diagnostic psychiatric interview
07900 and rate each response from the `patients' along a 0-9 scale of
08000 paranoidness, 0 meaning zero and 9 being highest. Transcripts of
08100 these interviews, without the ratings of the interviewers, were then
08200 utilized for various experiments in which randomly selected expert
08300 judges conducted evaluations of the interview transcripts. For
08400 example, in one experiment it was found that patients and model were
08500 indistinguishable along the dimension of paranoidness.
08600 To ask the machine-question, we sent interview transcripts,
08700 one with a patient and one with PARRY, to 100 psychiatrists randomly
08800 selected from the Directory of American Specialists and the Directory
08900 of the American Psychiatric Association. Of the 41 replies 21 (51%)
09000 made the correct identification while 20 (49%) were wrong. Based on
09100 this random sample of 41 psychiatrists, the 95% confidence interval
09200 is between 35.9 and 66.5, a range which is close to chance. (Our
09300 statistical consultant was Dr. Helena C. Kraemer, Research
09400 Associate in Biostatistics, Department of Psychiatry, Stanford
09500 University.)
09600 Psychiatrists are considered expert judges of patient
09700 interview behavior but they are unfamiliar with computers. Hence we
09800 conducted the same test with 100 computer scientists randomly
09900 selected from the membership list of the Association for Computing
10000 Machinery, ACM. Of the 67 replies 32 (48%) were right and 35 (52%)
10100 were wrong. Based on this random sample of 67 computer scientists the
10200 95% confidence ranges from 36 to 60, again close to a chance level.
10300 Thus the answer to this machine-question "can expert judges,
10400 psychiatrists aand computer scientists, using teletyped transcripts
10500 of psychiatric interviews, distinguish between paranoid patients and
10600 a simulation of paranoid processes? " is "No". But what do we learn
10700 from this? It is some comfort that the answer was not "yes"and the
10800 null hypothesis (no differences) failed to be rejected, especially
10900 since statistical tests are somewhat biased in favor of rejecting the
11000 null hypothesis (Meehl,1967). Yet this answer does not tell us what
11100 we would most like to know, i.e. how to improve the model.
11200 Simulation models do not spring forth in a complete, perfect and
11300 final form; they must be gradually developed over time. Pehaps we
11400 might obtain a "yes" answer to the machine-question if we allowed a
11500 large number of expert judges to conduct the interviews themselves
11600 rather than studying transcripts of other interviewers. It would
11700 indicate that the model must be improved but unless we systematically
11800 investigated how the judges succeeded in making the discrimination we
11900 would not know what aspects of the model to work on. The logistics of
12000 such a design are immense and obtaining a large N of judges for sound
12100 statistical inference would require an effort disproportionate to the
12200 information-yield.
12300 A more efficient and informative way to use Turing-like tests
12400 is to ask judges to make ordinal ratings along scaled dimensions from
12500 teletyped interviews. We shall term this approach asking the
12600 dimension-question. One can then compare scaled ratings received by
12700 the patients and by the model to precisely determine where and by how
12800 much they differ. Model builders strive for a model which
12900 shows indistinguishability along some dimensions and
13000 distinguishability along others. That is, the model converges on what
13100 it is supposed to simulate and diverges from that which it is not.
13200 We mailed paired-interview transcripts to another 400
13300 randomly selected psychiatrists asking them to rate the responses of
13400 the two `patients' along certain dimensions. The judges were divided
13500 into groups, each judge being asked to rate responses of each I-O
13600 pair in the interviews along four dimensions. The total number of
13700 dimensions in this test were twelve- linguistic noncomprehension,
13800 thought disorder, organic brain syndrome, bizarreness, anger, fear,
13900 ideas of reference, delusions, mistrust, depression, suspiciousness
14000 and mania. These are dimensions which psychiatrists commonly use in
14100 evaluating patients.
14200 Table 1 shows there were significant differences, with PARRY
14300 receiving higher scores along the dimensions of linguistic
14400 noncomprehension,thought disorder, bizarreness, anger, mistrust and
14500 suspiciousness. On the dimension of delusions the patients were rated
14600 significantly higher. There were no significant differences along the
14700 dimensions of organic brain syndrome,fear, ideas of reference,
14800 depression and mania.
14900 While tests asking the machine-question indicate
15000 indistinguishability at the gross level, a study of the finer
15100 structure os the model's behavior through ratings along scaled
15200 dimensions shows statistically significant differences between
15300 patients and model. These differences are of help to the model
15400 builder in suggesting which aspects of the model must be modified and
15500 improved in order to be considered an adequate simulation of the
15600 class of paranoid patients it is intended to simulate. For example,
15700 it is clear that PARRY'S language-comprehension must be improved.
15800 Once this has been implemented, a future test will tell us whether
15900 improvement has occurred and by how much in comparison to the earlier
16000 version. Successive identification of particular areas of failure
16100 in the model permits their improvement and the development of
16200 more adequate model-versions.
16300 Further evidence that the machine-question is too coarse and
16400 insensitive a test comes from the following experiment. In this test
16500 we constructed a random version of the paranoid model which utilized
16600 PARRY'S output statements but expressed them randomly no matter what
16700 the interviewer said. Two psychiatrists conducted interviews with
16800 this model, transcripts of which were paired with patient interviews
16900 and sent to 200 randomly selected psychiatrists asking both the
17000 machine-question and the dimension-question. Of the 69 replies, 34
17100 (49%) were right and 35 (51%) wrong. Based on this random sample of
17200 69 psychiatrists, the 95% confidence interval ranges from 39 to 63,
17300 again indicating a chance level. However as shown in Table 2
17400 significant differences appear along the dimensions of linguistic
17500 noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
17600 rated higher. On these particular dimensions we can construct a
17700 continuum in which the random version represents one extreme, the
17800 actual patients another. Our (nonrandom) PARRY lies somewhere between
17900 these two extremes, indicating that it performs significantly better
18000 than the random version but still requires improvement before being
18100 indistinguishable from patients.(See Fig.1). Table 3 presents t
18200 values for differences between mean ratings of PARRY and
18300 RANDOM-PARRY. (See Table 2 and Fig.1 for the mean ratings).
18310 Thus it can be seen that such a multidimensional analysis
18410 provides yardsticks for measuring the adequacy of this or any other
18510 dialogue simulation model along the relevant dimensions.
18800 We conclude that when model builders want to conduct tests
18900 which indicate in which direction progress lies and to obtain a
19000 measure of whether progress is being achieved, the way to use
19100 Turing-like tests is to ask expert judges to make ratings along
19200 multiple dimensions that are essential to the model. Useful tests do
19300 not prove a model, they probe it for its strengths and weaknesses.
19400 Simply asking the machine-question yields little information relevant
19500 to what the model builder most wants to know, namely, along what
19600 dimensions must the model be improved.
19700
19800
19900 REFERENCES
20000
20100 [1] Colby, K.M., Weber, S. and Hilf,F.D.,1971. Artificial paranoia.
20200 ARTIFICIAL INTELLIGENCE,2, 1-25.
20300
20400
20500 [2] Colby,K.M.,Hilf,F.D.,Weber, S.and Kraemer,H.C.,1972. Turing-like
20600 indistinguishability tests for the validation of a computer
20700 simulation of paranoid processes. ARTIFICIAL INTELLIGENCE,3,
20800 199-221.
20900
21000 [3] Hilf, F.D.,1972. Non-nonverbal communication and psychiatric research.
21100 ARCHIVES OF GENERAL PSYCHIATRY, 27, 631-635.
21200 [4] Meehl, P.E.,1967. Theory testing in psychology and physics: a
21300 methodological paradox. PHILOSOPHY OF SCIENCE,34,103-115.
21400
21500 [5] Turing,A.,1950. Computing machinery and intelligence. Reprinted in:
21600 COMPUTERS AND THOUGHT (Feigenbaum, E.A. and Feldman, J.,eds.).
21700 McGraw-Hill, New York,1963,pp. 11-35.